Zipf law in the popularity distribution of chess openings 
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We perform a quantitative analysis of extensive chess databases and show that the frequencies of 
opening moves are distributed according to a power-law with an exponent that increases linearly 
with the game depth, whereas the pooled distribution of all opening weights follows Zipf 's law with 
universal exponent. We propose a simple stochastic process that is able to capture the observed 
playing statistics and show that the Zipf law arises from the self-similar nature of the game tree of 
chess. Thus, in the case of hierarchical fragmentation the scaling is truly universal and independent 
of a particular generating mechanism. Our findings are of relevance in general processes with 
composite decisions. 
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Decision making refers to situations where individuals 
have to select a course of action among multiple alter- 
natives Such processes are ubiquitous, ranging from 
one's personal life to business, management and politics 
and take a large part in shaping our life and society. 
Decision making is an immensely complex process and, 
given the number of factors that influence each choice, a 
quantitative understanding in terms of statistical laws re- 
mains a difficult and often elusive goal. Investigations are 
complicated by the shortage of reliable data sets, since 
information about human behavior is often difficult to 
be quantified and not easily available in large numbers, 
whereas decision processes typically involve a huge space 
of possible courses of action. Board games, such as chess, 
provide a well-documented case where the players in turn 
select their next move among a set of possible game con- 
tinuations that are determined by the rules of the game. 

Human fascination with the game of chess is long- 
standing and pervasive Q, not least due to the sheer in- 
finite richness of the game. The total number of different 
games that can be played, i.e., the game tree complex- 
ity of chess, has roughly been estimated as the average 
number of legal moves in a chess position to the power of 
the length of a typical game, yielding the Shannon num- 
ber 30 80 w 10 120 |3|. Obviously only a small fraction of 
all possible games can be realized in actual play. But 
even during the first moves of a game, when the game 
complexity is still manageable, not all possibilities are 
explored equally often. While the history of successful 
initial moves has been classified in opening theory 
about the mechanisms underlying the formation of fash- 
ionable openings not much is known [5| . With the recent 
appearance of extensive databases playing habits have 
become accessible to quantitative analysis, making chess 
an ideal platform for analyzing human decision processes. 

The set of all possible games can be represented by 
a directed graph whose nodes are game situations and 
whose edges correspond to legal moves from each posi- 
tion (Fig. 1). Every opening is represented by its move 



sequence as a directed path starting from the initial node. 
We will differentiate between two game situations if they 
are reached by different move sequences. This way the 
graph becomes a game tree, and each node a is uniquely 
assoiciated with an opening sequence. 

Using a chess-database [6[ we can measure the popular- 
ity n a or weight of every opening sequence as the number 
of occurences in the database. We find that the weighted 
game-tree of chess is self-similar and the frequencies S(n) 
of weights follow a Zipf-Law 



S(n) ~ n- a 



(1) 



with universal exponent a — 2. Note, the precise scal- 
ing in the histogram of weight frequencies S(n) and in 
the cumulative distribution C(n) over the entire ob- 
servable range (Fig. 2A). Similar power-law distribu- 
tions with universal exponent have been identified in a 
large number of natural, economic and social systems 
0, S, 0, M, EI El EI EJ, El - a fact which has come 
to be known as Zipf- or Pareto law [7], [8( . If we count 
only the frequencies Sd{n) of opening weights n a after 
the first d moves we still find broad distributions consis- 
tent with power-law behavior Sd{n) ~ rT ad (Fig. 2B). 
The exponents ay are not universal, however, but in- 
crease linearly with d (Fig. 2B, inset). The results are 
robust: similar power-laws could be observed in different 
databases and other board games, regardless of the con- 
sidered game depth, constraints on player levels or the 
decade when the games were played. Stretching over six 
orders of magnitude, the here reported distributions are 
one of the most precise examples for power-laws known 
today in social data sets. 

As seen in (FigfTJ) for each node a the weights of its 
subtrees define a partition of the integers (1 . . . n CT ). The 
assumption of self-similarity implies a statistical equiv- 
alence of the branching in the nodes of the tree. We 
can thus define the branching ratio distribution over the 
real interval r £ [0, 1] by the probability Q(r\ri) that a 
random pick from the numbers 1 ... n is in a subset of 



2 



45%/ / |g%~^\35% 

J _l H 

45% / | 22% ^52% (23^^^ 4% 

** ffl a ■ si pa, 

(77% ^3% Ll^^\73% 

^ a _j, 



i 


d4 






d5 m 




1 

c4 






1 

e6 


1 


1 

c5 






1 

c6 





Figure 1: A) Schematic representation of the weighted game 
tree of chess based on the ScidBase [(| for the first three half 
moves. Each node indicates a state of the game. Possible 
game continuations are shown as solid lines together with the 
branching ratios r^. Dotted lines symbolize other game con- 
tinuations, which are not shown. B) Alternative represen- 
tation emphasizing the successive segmentation of the set of 
games, here indicated for games following a l.d4 opening until 
the fourth half move d = 4. Each node a is represented by a 
box of a size proportional to its frequency n a . In the subse- 
quent half move these games split into subsets (indicated ver- 
tically below) according to the possible game continuations. 
Highlighted in (A) and (B) is a popular opening sequence 
l.d4 Nf6 2.c4 e6 (Indian Defense). 



size smaller or equal to rn. Taking n to infinity Q(r\n) 
may have a continuous limit Q(r) for which we find the 
probability density function (pdf) q(r) — Q'(r). If the 
limit distribution q(r) of branching ratios exists it carries 
the fingerprint of the generating process. For instance, 
the continuum limit of the branching ratio distribution 
for a Yule-Simon preferential growth process [1 31 ] in each 



node of the tree would be q(r) 



where (3 < is a 



model specific parameter. On the other hand, in a k- 
ary tree where each game continuation has a uniformly 
distributed random a-priori probability the continuum 
limit corresponds to a random stick breaking process in 
each node, yielding q(r) ~ (1 — r) k ~ 2 . For the weighted 
game tree of chess q(r) can directly be measured from the 
database (Fig. ER). We find that q(r) is remarkably con- 
stant over most of the interval but diverges with exponent 
0.5 asr^ 1, and is very well fitted by the parameterless 
arcsine-distribution 



q(r) 



(2) 



The form of the branching ratio distribution suggests 
that in the case of chess there is no preferential growth 
process involved, but something entirely different which 
must be rooted in the decision process during the opening 
moves of a chess game 0. 

In the following we show that the asymptotic Zipf-Law 
in the weight frequencies arises independently from the 
specific form of the distribution q(r), and hence, the mi- 
croscopic rules of the underlying branching process. Con- 
sider N realizations of a general self-similar random seg- 
mentation process of N integers, with paths (ao, a%, . . . ) 
in the corresponding weighted tree. In the context of 
chess each realization of this process corresponds to a 
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Figure 2: A) Histogram of weight frequencies S(n) of open- 
ings up to d — 40 in the Scid database (brown dots) and 
with logarithmic binning (blue). A straight line fit (not 
shown) yields an exponent of a = 2.05 with a goodness 
of fit R 2 > 0.9992. For comparison, the Zipf-distribution 
Eq. © with /i = 1 is indicated as a solid line. Inset: num- 
ber C(n) = Yl m =n+i S(m) of openings with a popularity 
m > n. C(n) follows a power-law with exponent a = 1.04 
(R 2 = 0.994). B) Number S d (n) of openings of depth d with 
a given popularity n for d = 16 (brown dots) and histograms 
with logarithmic binning for d = 4 (blue), d = 16 (red) and 
d — 22 (black). Solid lines are regression lines to the loga- 
rithmically binned data (R 2 > 0.99 for d < 35). Inset: slope 
ad of the regression line as a function of d (red dots) and the 



analytical estimation Eq. ([§} using 
(black solid line). 



1.4 ■ 10° and f3 = 



random game from the database of N games (e.g., dark 



shading in Fig. 1). The weights 
multiplicative random process 



"0 



N 



describe a 



(3) 



where the branching ratios r<i = rid/rid-i for sufficiently 
large nd are distributed according to q(r) independent 
of d. For lower values of rid the continuous branching 
ratio distribution is no longer a valid approximation and 
a node of weight one has at most one subtree, i.e. the 
state rid — 1 is absorbing. 

To calculate the probability density function (pdf) 
p d (n) of the random variable rid after d steps it is con- 
venient to consider the log-transformed variables v = 
log(N/n) and p = — logr. The corresponding pro- 
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cess {vd\ is a random walk v d — J2i=i Pi with non- 
negative increments pi and its pdf 7Td(z/) transforms as 
n Pd{ n ) — 7r d( l/ )- An analytic solution can be obtained 
for the class 



< r < 1 



(4) 



of power-law distributions, which typically arise in prefer- 
ential attachment schemes. In this case the jump process 
Vd is Poissonian and distributed according to a gamma 
distribution ir d {v) = jj^yv d - 1 e- (1+l3 > . After retrans- 
formation to the original variables and noting that from 
the probability pd(n) for a single node at distance d to the 
root to have the weight n one obtains the expected num- 
ber Sd{n) of these nodes in N realizations of the random 
process as Sd(n) — Npd{n)/n, and in particular 



Sd{n) 



N(d-1)\ 



nj 



d-l 



N 



i-0 



(5) 



The functions Sd{n) are strongly skewed and can exhibit 
power-law like scaling over several decades. A logarith- 
mic expansion for 1 < n <C N shows that they approxi- 



mately follow a scaling law S d (n) 



with exponent 



a d =(l-(3) + 



1 



log N 



(d-l) 



(6) 



The exponent a d is linearly increasing with the game 
depth d and with a logarithmic finite size correction 
which is in excellent agreement with the chess database 
(Fig. 2B, inset). Power-laws in the stationary distri- 
bution of random segmentation and multiplicative pro- 
cesses have been reported before [|| and can be ob- 
tained by introducing slight modifications, such as re- 
flecting boundaries, frozen segments, merging or reset 
events Il7l . [l8l ] . In contrast, the approximate scal- 
ing of S d (n) in Eq. (0 is fundamentally different, as our 
process does not admit a stationary distribution. The ex- 
ponents a d increase due to the finite size of the database. 

As shown in Fig. 3B we find excellent agreement be- 
tween the weight frequencies S d (n) in the chess database 
and direct simulations of the multiplicative process, 
Eq. using the arcsine distribution Eq. ||3J|. If the 
branching ratios are approximated by a uniform distri- 
bution q(r) = 1 the predicted values of Sd(n) are system- 
atically too small, since a uniform distribution yields a 
larger flow into the absorbing state n* = 1 than observed 
in the database. Still, due to the asymtotic behavior 
of q(r) for r — > 0, this approximation yields the correct 
slope in the log-log plot so that the exponent ad can be 
estimated quite well based on Eq. (J6]) with /3 = 0. 

By observing that S d (n) in Eq. (JSJ) is the d-th term 
in a series expansion of an exponential function, we 
find the weight distribution in the whole game tree as 
S( n ) = J2d^d(n) to be an exact Zipf-Law. For branch- 
ing ratio distributions q(r) different from Eq. ([3]) the 




Figure 3: A) Probability density q(r) of branching ratios r 
sampled from all games in the Scid database with a bin size of 
Ar = 0.01 (red bars) and arcsine distribution Eq. ((2} (black 
solid line). Every edge of the weighted game tree, from nodes 
of size nd-i to rid, contributes to the bin corresponding to 
r = na/ud-i with weight r. We disregarded clusters with 
rid < 100 so that, in principle, a cluster could contribute to 
any of the bins. We found q(r) to be depth-independent. B) 
Distribution of opening popularities Sd(n) for d — 22 obtained 
from the Scid database (black) and from a direct simulation of 
the multiplicative process Eq. (2), whith branching ratios q(r) 
taken from a uniform (red) or arcsine (blue) distribution. Fur- 
ther indicated is the theoretical result Eq. ([5]) (dashed line). 
Similar results are obtained for other values of d. 



weight frequencies are difficult to obtain analytically. 
But using renewal theory (lij the scaling can be shown 
to hold asymptotically for n -C N and a large class 
of distributions q(r). For this, note that the random 
variable r(y) = max(<i : Vd < v) is a renewal process 
in v. The expectation E[r(i/)] is the corresponding re- 
newal function related to the distributions of the v d as 
SfciP r °b(* y (2 < v) = E[t(z/)]. If the expected value 
H = E[/3] = E[— logr] is finite and positive (e.g., for the 
distribution (HJ \i = 1/(1 + /?)), the renewal theorem pro- 
vides 

lim ±E[r(u)} = -. (7) 
v— >oo dv fi 

Thus, we obtain lim^^oo X^d=i 7r d( i/ ) = ano - finally 

N 

lim S(n) = — r . (8) 
[in 2 



4 




fraction of games W game depth d 

Figure 4: Inequality [Til I20I ] of the distribution Sd(n). A) 
Proportion IE of games that is concentrating in the fraction 
Q of the most popular openings, for several levels of the game 
depth d. B) Q as a function of d for three different values of 
W (solid lines) and Gini-coefficient G = 1 - 2 Q{W) dW as 
a function of game depth (dotted line). 

Thus, the multiplicative random process (Eq.[3|) with any 
well behaving branching ratio distribution q(r) on the 
intervall [0, 1] always leads to an asymptotically univer- 
sal scaling for n <C N (compare also the excellent fit of 
Eq. ijHJ) to the chess data in Fig. 2a). In [lj| the same 
Zipf-Law scaling was found for the sizes of the directory 
trees in a computer cluster. The authors propose a grow- 
ing mechanism based on linear preferential attachment. 
Here we have shown that the exponent a = 2 for the 
weight distribution of subtrees in a self-similar tree is 
truly universal in the sense that it is the same for a much 
larger class of generating processes and not restricted to 
preferential attachment or growing. 

There are direct implications of our theory to general 
composite decision processes, where each action is as- 
sembled from a sequence of d mutually exclusive choices. 
What in chess corresponds to an opening sequence, may 
be a multivariate strategy or a customized ordering in 
other situations. The question how such strategies are 
distributed is important for management and marketing 
[UJ]. One consequence of our theory is, that in a process 
of d composite decisions the distribution Sd{n) ~ ■nT OLd 
of decision sequences, or strategies, which occur n times 
shows a transition from low exponents ad < 2, where 
a few strategies are very common, to higher exponents 
ad > 2, where individual stategies are dominating. This 
is due to the divergence of the first moment in power-laws 
with exponents smaller than two [UJ]. From (Eq.© the 
critical number d cr of descisions at which this transition 
occurs depends logarithmically on the sample size N and 
on the leading order (3 of q{r) near zero as 

dcr = l + (l + /3) log N. (9) 

Applied to the chess database with N = 1.4 • 10 6 we 
obtain d cr w 15 (see also Fig. 4 and Fig. 2B inset). This 
separates the database into two very different regimes: in 
their initial phase (d < d cr ) the majority of chess games is 
distributed among a small number of fashionable open- 
ings (for d = 12, for example, 80% of all games in the 



database are concentrated in about 23% of the most pop- 
ular openings), whereas beyond the critical game depth 
rarely used move sequences are dominating such that 
in aggregate they comprise the majority of all games 
(Fig. 4). Note, that this result arises from the statistics 
of iterated decisions and does not indicate a crossover of 
playing behavior with increasing game depth. 

Our study suggests the analysis of board games as a 
promising new perspective for statistical physics. The 
enormous amount of information contained in game 
databases, with its evolution resolved in time and in re- 
lation to an evolving network of players, provide a rich 
environment to study the formation of fashions and col- 
lective behavior in social systems. 

We are indebted to Andriy Bandrivskyy for invaluable 
help with the data analysis. 
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